Disease Prediction based on symptoms using NLP¶
Importing libraries¶
In [ ]:
import os
import zipfile
import pickle
import pandas as pd
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
import matplotlib.pyplot as plt
%matplotlib inline
# To show all the rows of pandas dataframe
pd.set_option('display.max_rows', None)
# To set the width of the column to maximum
pd.set_option('max_colwidth', 1)
Importing dataset¶
In [ ]:
def unzip_file(zip_path, extract_to):
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
zip_ref.extractall(extract_to)
In [ ]:
zip_file_path = 'dataset-zip.zip'
extraction_path = os.getcwd()
unzip_file(zip_file_path, extraction_path)
In [ ]:
train=pd.read_csv('dataset/drugsComTrain_raw.csv')
test=pd.read_csv('dataset/drugsComTest_raw.csv')
df=pd.concat([train,test])
In [ ]:
# df = pd.read_csv('dataset/mydrugsmerged.csv')
Exploratory Data Analysis¶
In [ ]:
df.shape
Out[ ]:
(215063, 7)
In [ ]:
df.head()
Out[ ]:
| uniqueID | drugName | condition | review | rating | date | usefulCount | |
|---|---|---|---|---|---|---|---|
| 0 | 206461 | Valsartan | Left Ventricular Dysfunction | "It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil" | 9 | May 20, 2012 | 27 |
| 1 | 95260 | Guanfacine | ADHD | "My son is halfway through his fourth week of Intuniv. We became concerned when he began this last week, when he started taking the highest dose he will be on. For two days, he could hardly get out of bed, was very cranky, and slept for nearly 8 hours on a drive home from school vacation (very unusual for him.) I called his doctor on Monday morning and she said to stick it out a few days. See how he did at school, and with getting up in the morning. The last two days have been problem free. He is MUCH more agreeable than ever. He is less emotional (a good thing), less cranky. He is remembering all the things he should. Overall his behavior is better. \nWe have tried many different medications and so far this is the most effective." | 8 | April 27, 2010 | 192 |
| 2 | 92703 | Lybrel | Birth Control | "I used to take another oral contraceptive, which had 21 pill cycle, and was very happy- very light periods, max 5 days, no other side effects. But it contained hormone gestodene, which is not available in US, so I switched to Lybrel, because the ingredients are similar. When my other pills ended, I started Lybrel immediately, on my first day of period, as the instructions said. And the period lasted for two weeks. When taking the second pack- same two weeks. And now, with third pack things got even worse- my third period lasted for two weeks and now it's the end of the third week- I still have daily brown discharge.\nThe positive side is that I didn't have any other side effects. The idea of being period free was so tempting... Alas." | 5 | December 14, 2009 | 17 |
| 3 | 138000 | Ortho Evra | Birth Control | "This is my first time using any form of birth control. I'm glad I went with the patch, I have been on it for 8 months. At first It decreased my libido but that subsided. The only downside is that it made my periods longer (5-6 days to be exact) I used to only have periods for 3-4 days max also made my cramps intense for the first two days of my period, I never had cramps before using birth control. Other than that in happy with the patch" | 8 | November 3, 2015 | 10 |
| 4 | 35696 | Buprenorphine / naloxone | Opiate Dependence | "Suboxone has completely turned my life around. I feel healthier, I'm excelling at my job and I always have money in my pocket and my savings account. I had none of those before Suboxone and spent years abusing oxycontin. My paycheck was already spent by the time I got it and I started resorting to scheming and stealing to fund my addiction. All that is history. If you're ready to stop, there's a good chance that suboxone will put you on the path of great life again. I have found the side-effects to be minimal compared to oxycontin. I'm actually sleeping better. Slight constipation is about it for me. It truly is amazing. The cost pales in comparison to what I spent on oxycontin." | 9 | November 27, 2016 | 37 |
In [ ]:
df.condition.value_counts().head(20)
Out[ ]:
condition Birth Control 38436 Depression 12164 Pain 8245 Anxiety 7812 Acne 7435 Bipolar Disorde 5604 Insomnia 4904 Weight Loss 4857 Obesity 4757 ADHD 4509 Diabetes, Type 2 3362 Emergency Contraception 3290 High Blood Pressure 3104 Vaginal Yeast Infection 3085 Abnormal Uterine Bleeding 2744 Bowel Preparation 2498 Smoking Cessation 2440 ibromyalgia 2370 Migraine 2277 Anxiety and Stress 2236 Name: count, dtype: int64
In [ ]:
# column_to_count = df['review'] # Replace 'column_name' with the actual column name
# # Step 3: Calculate the total number of words
# total_words = column_to_count.str.split().apply(len).sum()
# print(f"Total number of words in the {column_to_count} column: {total_words}")
In [ ]:
# exploring unique elements in the dataset
print("Number of Unique Drugs present in the Dataset : ", df['drugName'].nunique())
print("Number of Unique Medical Conditions present in the Dataset : ", df['condition'].nunique())
Number of Unique Drugs present in the Dataset : 3671 Number of Unique Medical Conditions present in the Dataset : 916
In [ ]:
# checking for missing values
df.isna().sum()
Out[ ]:
uniqueID 0 drugName 0 condition 1194 review 0 rating 0 date 0 usefulCount 0 dtype: int64
In [ ]:
# dropping the missing values of the conditions
df = df.dropna()
df.isna().sum()
Out[ ]:
uniqueID 0 drugName 0 condition 0 review 0 rating 0 date 0 usefulCount 0 dtype: int64
In [ ]:
df['drugName'].value_counts().head(10)
Out[ ]:
drugName Levonorgestrel 4896 Etonogestrel 4402 Ethinyl estradiol / norethindrone 3619 Nexplanon 2892 Ethinyl estradiol / norgestimate 2682 Ethinyl estradiol / levonorgestrel 2400 Phentermine 2077 Sertraline 1859 Escitalopram 1739 Mirena 1673 Name: count, dtype: int64
In [ ]:
df['condition'].value_counts().head(20).plot(kind='barh', figsize=(10, 4))
Out[ ]:
<Axes: ylabel='condition'>
In [ ]:
import warnings
warnings.filterwarnings("ignore")
plt.rcParams['figure.figsize'] = (10, 4)
plt.subplot(1, 2, 1)
sns.distplot(df['rating'])
plt.subplot(1, 2, 2)
sns.distplot(df['usefulCount'])
plt.suptitle('Distribution of Rating and Useful Count \n ', fontsize = 20)
plt.show()
In [ ]:
import seaborn as sns
# Define the custom linear gradient color palette
custom_palette = sns.color_palette("RdYlGn", 10)
plt.rcParams['figure.figsize'] = (10, 4)
sns.barplot(x=df['rating'], y=df['usefulCount'], palette=custom_palette)
plt.xlabel('\n Ratings')
plt.ylabel('Rated Useful Count\n', fontsize=20)
plt.title('\n Rating vs Usefulness \n', fontsize=20)
plt.show()
In [ ]:
# Aggregate the data to count the number of occurrences for each drug-condition pair
drug_condition_count = df.groupby(['condition', 'drugName']).size().reset_index(name='count')
# Sort the data based on the count to see the most common combinations
sorted_drug_condition_count = drug_condition_count.sort_values(by='count', ascending=False)
# Display the top 20 drug-condition pairs for a sense of the most common combinations
sorted_drug_condition_count.head(20)
Out[ ]:
| condition | drugName | count | |
|---|---|---|---|
| 2185 | Birth Control | Etonogestrel | 4394 |
| 2182 | Birth Control | Ethinyl estradiol / norethindrone | 3081 |
| 2215 | Birth Control | Levonorgestrel | 2884 |
| 2251 | Birth Control | Nexplanon | 2883 |
| 2180 | Birth Control | Ethinyl estradiol / levonorgestrel | 2107 |
| 2183 | Birth Control | Ethinyl estradiol / norgestimate | 2097 |
| 3959 | Emergency Contraception | Levonorgestrel | 1651 |
| 9276 | Weight Loss | Phentermine | 1650 |
| 2196 | Birth Control | Implanon | 1496 |
| 9159 | Vaginal Yeast Infection | Miconazole | 1338 |
| 2244 | Birth Control | Mirena | 1320 |
| 8566 | Smoking Cessation | Varenicline | 1079 |
| 2287 | Birth Control | Skyla | 1074 |
| 9172 | Vaginal Yeast Infection | Tioconazole | 980 |
| 2221 | Birth Control | Lo Loestrin Fe | 896 |
| 6525 | Obesity | Bupropion / naltrexone | 888 |
| 6527 | Obesity | Contrave | 864 |
| 8553 | Smoking Cessation | Chantix | 857 |
| 2178 | Birth Control | Ethinyl estradiol / etonogestrel | 827 |
| 2259 | Birth Control | NuvaRing | 824 |
In [ ]:
# To make a meaningful graph, let's first aggregate the counts of reviews per condition
condition_count = df['condition'].value_counts().reset_index()
condition_count.columns = ['condition', 'count']
# Select the top 10 conditions to keep the graph interpretable
top_conditions = condition_count.head(10)
# Plotting the top conditions by their count of reviews
plt.figure(figsize=(12, 8))
sns.barplot(x='count', y='condition', data=top_conditions, palette='coolwarm')
plt.title('Top 10 Conditions by Number of Reviews')
plt.xlabel('Number of Reviews')
plt.ylabel('Condition')
plt.show()
In [ ]:
import networkx as nx
import matplotlib.pyplot as plt
from networkx.drawing.nx_agraph import graphviz_layout
# Reduce the dataset to a manageable size by focusing on conditions with the most reviews
top_conditions_list = top_conditions['condition'].tolist()
reduced_df = df[df['condition'].isin(top_conditions_list)]
# Create a graph
G = nx.Graph()
# Add nodes and edges from the reduced dataset
for index, row in reduced_df.iterrows():
condition = row['condition']
drug = row['drugName']
G.add_node(condition, type='condition')
G.add_node(drug, type='drug')
G.add_edge(condition, drug)
# Draw the network graph with labels for nodes to identify conditions and medications
plt.figure(figsize=(16, 12))
pos = nx.spring_layout(G, k=0.5, iterations=20)
# Nodes
nx.draw_networkx_nodes(G, pos, node_size=20, node_color='skyblue', alpha=0.6, label=[n for n in G.nodes if G.nodes[n]['type']=='condition'])
nx.draw_networkx_nodes(G, pos, node_size=20, node_color='lightgreen', alpha=0.6, label=[n for n in G.nodes if G.nodes[n]['type']=='drug'])
# Edges
nx.draw_networkx_edges(G, pos, alpha=0.4)
# Labels
nx.draw_networkx_labels(G, pos, font_size=7, font_color='darkblue')
plt.title('Enhanced Network Graph of Conditions and Medications with Labels')
plt.axis('off')
plt.show()
In [ ]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# Data for bar chart (top conditions and drugs)
top_conditions_data = df['condition'].value_counts().head(10)
top_drugs_data = df['drugName'].value_counts().head(10)
# Mock-up data for model performance (assuming generic values)
accuracy = 0.85
precision = 0.8
recall = 0.75
f1_score = 0.77
# WordCloud for conditions
wordcloud_conditions = WordCloud(background_color='white', width=400, height=400).generate(' '.join(df['condition'].dropna().unique()))
# Bar chart for conditions
plt.figure(figsize=(8, 5))
top_conditions_data.plot(kind='barh', color='skyblue')
plt.title('Top 10 Conditions')
plt.tight_layout()
plt.show()
# Bar chart for drugs
plt.figure(figsize=(8, 5))
top_drugs_data.plot(kind='barh', color='lightgreen')
plt.title('Top 10 Drugs')
plt.tight_layout()
plt.show()
# Model performance summary
# plt.figure(figsize=(8, 5))
# plt.text(0.5, 0.8, f'Accuracy: {accuracy}', ha='center')
# plt.text(0.5, 0.6, f'Precision: {precision}', ha='center')
# plt.text(0.5, 0.4, f'Recall: {recall}', ha='center')
# plt.text(0.5, 0.2, f'F1 Score: {f1_score}', ha='center')
# plt.axis('off')
# plt.title('Model Performance')
# plt.tight_layout()
# plt.show()
# WordCloud for conditions
plt.figure(figsize=(8, 8))
plt.imshow(wordcloud_conditions, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Conditions')
plt.tight_layout()
plt.show()
#make labels bold
In [ ]:
wordcloud_medications = WordCloud(background_color='white', width=400, height=400).generate(' '.join(df['drugName'].dropna().unique()))
plt.figure(figsize=(8, 8))
plt.imshow(wordcloud_medications, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Medications')
plt.tight_layout()
plt.show()
In [ ]:
df_birth=df[(df['condition']=='Birth Control')]
df_dep=df[(df['condition']=='Depression')]
df_bp=df[(df['condition']=='High Blood Pressure')]
df_diab=df[(df['condition']=='Diabetes, Type 2')]
In [ ]:
from wordcloud import WordCloud
plt.figure(figsize = (20,20)) # Text that is Fake News Headlines
wc = WordCloud(max_words = 500 , width = 1600 , height = 800).generate(" ".join(df_birth.review))
plt.imshow(wc , interpolation = 'bilinear')
plt.title('Word cloud for Birth control',fontsize=14)
Out[ ]:
Text(0.5, 1.0, 'Word cloud for Birth control')
In [ ]:
plt.figure(figsize = (20,20)) # Text that is Fake News Headlines
wc = WordCloud(max_words = 500 , width = 1600 , height = 800).generate(" ".join(df_dep.review))
plt.imshow(wc , interpolation = 'bilinear')
plt.title('Word cloud for Depression',fontsize=14)
Out[ ]:
Text(0.5, 1.0, 'Word cloud for Depression')
In [ ]:
plt.figure(figsize = (20,20)) # Text that is Fake News Headlines
wc = WordCloud(max_words = 500 , width = 1600 , height = 800).generate(" ".join(df_bp.review))
plt.imshow(wc , interpolation = 'bilinear')
plt.title('Word cloud for High Blood Pressure',fontsize=14)
Out[ ]:
Text(0.5, 1.0, 'Word cloud for High Blood Pressure')
In [ ]:
plt.figure(figsize = (20,20)) # Text that is Fake News Headlines
wc = WordCloud(max_words = 500 , width = 1600 , height = 800).generate(" ".join(df_diab.review))
plt.imshow(wc , interpolation = 'bilinear')
plt.title('Word cloud for Diabetes Type 2',fontsize=14)
Out[ ]:
Text(0.5, 1.0, 'Word cloud for Diabetes Type 2')
Data Preprocessing¶
In [ ]:
df_train = df[(df['condition']=='Birth Control') | (df['condition']=='Depression') | (df['condition']=='High Blood Pressure')|(df['condition']=='Diabetes, Type 2') | (df['condition']=='Insomnia') | (df['condition']=='GERD') | (df['condition']=='Cough') | (df['condition']=='Acne') | (df['condition']=='Anxiety') | (df['condition']=='Constipation') | (df['condition']=='Migraine')]
In [ ]:
df_train.shape
Out[ ]:
(83806, 7)
In [ ]:
X = df_train.drop(['uniqueID','drugName','rating','date','usefulCount'],axis=1)
In [ ]:
X.shape
Out[ ]:
(83806, 2)
In [ ]:
X.condition.value_counts()
Out[ ]:
condition Birth Control 38436 Depression 12164 Anxiety 7812 Acne 7435 Insomnia 4904 Diabetes, Type 2 3362 High Blood Pressure 3104 Migraine 2277 Constipation 2120 Cough 1224 GERD 968 Name: count, dtype: int64
In [ ]:
X.isna().sum()
Out[ ]:
condition 0 review 0 dtype: int64
In [ ]:
#check for duplicate values
X.duplicated().sum()
Out[ ]:
35975
In [ ]:
X.head()
Out[ ]:
| condition | review | |
|---|---|---|
| 2 | Birth Control | "I used to take another oral contraceptive, which had 21 pill cycle, and was very happy- very light periods, max 5 days, no other side effects. But it contained hormone gestodene, which is not available in US, so I switched to Lybrel, because the ingredients are similar. When my other pills ended, I started Lybrel immediately, on my first day of period, as the instructions said. And the period lasted for two weeks. When taking the second pack- same two weeks. And now, with third pack things got even worse- my third period lasted for two weeks and now it's the end of the third week- I still have daily brown discharge.\nThe positive side is that I didn't have any other side effects. The idea of being period free was so tempting... Alas." |
| 3 | Birth Control | "This is my first time using any form of birth control. I'm glad I went with the patch, I have been on it for 8 months. At first It decreased my libido but that subsided. The only downside is that it made my periods longer (5-6 days to be exact) I used to only have periods for 3-4 days max also made my cramps intense for the first two days of my period, I never had cramps before using birth control. Other than that in happy with the patch" |
| 9 | Birth Control | "I had been on the pill for many years. When my doctor changed my RX to chateal, it was as effective. It really did help me by completely clearing my acne, this takes about 6 months though. I did not gain extra weight, or develop any emotional health issues. I stopped taking it bc I started using a more natural method of birth control, but started to take it bc I hate that my acne came back at age 28. I really hope symptoms like depression, or weight gain do not begin to affect me as I am older now. I'm also naturally moody, so this may worsen things. I was in a negative mental rut today. Also I hope this doesn't push me over the edge, as I believe I am depressed. Hopefully it'll be just like when I was younger." |
| 11 | Depression | "I have taken anti-depressants for years, with some improvement but mostly moderate to severe side affects, which makes me go off them.\n\nI only take Cymbalta now mostly for pain.\n\nWhen I began Deplin, I noticed a major improvement overnight. More energy, better disposition, and no sinking to the low lows of major depression. I have been taking it for about 3 months now and feel like a normal person for the first time ever. Best thing, no side effects." |
| 13 | Cough | "Have a little bit of a lingering cough from a cold. Not giving me much trouble except keeps me up at night. I heard this was good so I took so I could get some sleep. Helped tremendously with the cough but then I was having bad stomach cramps and diarrhea. I hadn't eaten anything that should have upset my stomach and it didn't really feel like a "bug" so I looked up side effects for Delsym. Now I wish I had done that first because I probably wouldn't have taken it. So, while it worked for my cough I still didn't get any sleep due to the stomach issues." |
In [ ]:
X['review'][2]
Out[ ]:
'"I used to take another oral contraceptive, which had 21 pill cycle, and was very happy- very light periods, max 5 days, no other side effects. But it contained hormone gestodene, which is not available in US, so I switched to Lybrel, because the ingredients are similar. When my other pills ended, I started Lybrel immediately, on my first day of period, as the instructions said. And the period lasted for two weeks. When taking the second pack- same two weeks. And now, with third pack things got even worse- my third period lasted for two weeks and now it's the end of the third week- I still have daily brown discharge.\nThe positive side is that I didn't have any other side effects. The idea of being period free was so tempting... Alas."'
In [ ]:
X['review'][11]
Out[ ]:
'"I have taken anti-depressants for years, with some improvement but mostly moderate to severe side affects, which makes me go off them.\n\nI only take Cymbalta now mostly for pain.\n\nWhen I began Deplin, I noticed a major improvement overnight. More energy, better disposition, and no sinking to the low lows of major depression. I have been taking it for about 3 months now and feel like a normal person for the first time ever. Best thing, no side effects."'
In [ ]:
for i, col in enumerate(X.columns):
X.iloc[:, i] = X.iloc[:, i].str.replace('"', '')
In [ ]:
X['review'][11]
Out[ ]:
'I have taken anti-depressants for years, with some improvement but mostly moderate to severe side affects, which makes me go off them.\n\nI only take Cymbalta now mostly for pain.\n\nWhen I began Deplin, I noticed a major improvement overnight. More energy, better disposition, and no sinking to the low lows of major depression. I have been taking it for about 3 months now and feel like a normal person for the first time ever. Best thing, no side effects.'
In [ ]:
X.head()
Out[ ]:
| condition | review | |
|---|---|---|
| 2 | Birth Control | I used to take another oral contraceptive, which had 21 pill cycle, and was very happy- very light periods, max 5 days, no other side effects. But it contained hormone gestodene, which is not available in US, so I switched to Lybrel, because the ingredients are similar. When my other pills ended, I started Lybrel immediately, on my first day of period, as the instructions said. And the period lasted for two weeks. When taking the second pack- same two weeks. And now, with third pack things got even worse- my third period lasted for two weeks and now it's the end of the third week- I still have daily brown discharge.\nThe positive side is that I didn't have any other side effects. The idea of being period free was so tempting... Alas. |
| 3 | Birth Control | This is my first time using any form of birth control. I'm glad I went with the patch, I have been on it for 8 months. At first It decreased my libido but that subsided. The only downside is that it made my periods longer (5-6 days to be exact) I used to only have periods for 3-4 days max also made my cramps intense for the first two days of my period, I never had cramps before using birth control. Other than that in happy with the patch |
| 9 | Birth Control | I had been on the pill for many years. When my doctor changed my RX to chateal, it was as effective. It really did help me by completely clearing my acne, this takes about 6 months though. I did not gain extra weight, or develop any emotional health issues. I stopped taking it bc I started using a more natural method of birth control, but started to take it bc I hate that my acne came back at age 28. I really hope symptoms like depression, or weight gain do not begin to affect me as I am older now. I'm also naturally moody, so this may worsen things. I was in a negative mental rut today. Also I hope this doesn't push me over the edge, as I believe I am depressed. Hopefully it'll be just like when I was younger. |
| 11 | Depression | I have taken anti-depressants for years, with some improvement but mostly moderate to severe side affects, which makes me go off them.\n\nI only take Cymbalta now mostly for pain.\n\nWhen I began Deplin, I noticed a major improvement overnight. More energy, better disposition, and no sinking to the low lows of major depression. I have been taking it for about 3 months now and feel like a normal person for the first time ever. Best thing, no side effects. |
| 13 | Cough | Have a little bit of a lingering cough from a cold. Not giving me much trouble except keeps me up at night. I heard this was good so I took so I could get some sleep. Helped tremendously with the cough but then I was having bad stomach cramps and diarrhea. I hadn't eaten anything that should have upset my stomach and it didn't really feel like a "bug" so I looked up side effects for Delsym. Now I wish I had done that first because I probably wouldn't have taken it. So, while it worked for my cough I still didn't get any sleep due to the stomach issues. |
Stopwords¶
In [ ]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
[nltk_data] Downloading package stopwords to [nltk_data] C:\Users\suhas\AppData\Roaming\nltk_data... [nltk_data] Package stopwords is already up-to-date! [nltk_data] Downloading package wordnet to [nltk_data] C:\Users\suhas\AppData\Roaming\nltk_data... [nltk_data] Package wordnet is already up-to-date!
Out[ ]:
True
In [ ]:
from nltk.corpus import stopwords
stop = stopwords.words('english')
In [ ]:
stop
Out[ ]:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
In [ ]:
#https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-qiqc kernel
from wordcloud import WordCloud, STOPWORDS
# Thanks : https://www.kaggle.com/aashita/word-clouds-of-various-shapes ##
def plot_wordcloud(text, mask=None, max_words=200, max_font_size=100, figure_size=(24.0,16.0),
title = None, title_size=40, image_color=False):
stopwords = set(STOPWORDS)
more_stopwords = {'one', 'br', 'Po', 'th', 'sayi', 'fo', 'Unknown'}
stopwords = stopwords.union(more_stopwords)
wordcloud = WordCloud(background_color='white',
stopwords = stopwords,
max_words = max_words,
max_font_size = max_font_size,
random_state = 42,
width=800,
height=400,
mask = mask)
wordcloud.generate(str(text))
plt.figure(figsize=figure_size)
if image_color:
image_colors = ImageColorGenerator(mask);
plt.imshow(wordcloud.recolor(color_func=image_colors), interpolation="bilinear");
plt.title(title, fontdict={'size': title_size,
'verticalalignment': 'bottom'})
else:
plt.imshow(wordcloud);
plt.title(title, fontdict={'size': title_size, 'color': 'black',
'verticalalignment': 'bottom'})
plt.axis('off');
plt.tight_layout()
plot_wordcloud(stop, title="Word Cloud of stops")
In [ ]:
not_stop = ["aren't","couldn't","didn't","doesn't","don't","hadn't","hasn't","haven't","isn't","mightn't","mustn't","needn't","no","nor","not","shan't","shouldn't","wasn't","weren't","wouldn't"]
for i in not_stop:
stop.remove(i)
Lemmitization¶
In [ ]:
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
porter = PorterStemmer()
lemmatizer = WordNetLemmatizer()
In [ ]:
print(porter.stem("sportingly"))
print(porter.stem("very"))
print(porter.stem("troubled"))
sportingli veri troubl
In [ ]:
print(lemmatizer.lemmatize("sportingly"))
print(lemmatizer.lemmatize("very"))
print(lemmatizer.lemmatize("troubled"))
sportingly very troubled
Review Cleaning¶
In [ ]:
from bs4 import BeautifulSoup
import re
In [ ]:
def review_to_words(raw_review):
# 1. Delete HTML
review_text = BeautifulSoup(raw_review, 'html.parser').get_text()
# 2. Make a space
letters_only = re.sub('[^a-zA-Z]', ' ', review_text)
# 3. lower letters
words = letters_only.lower().split()
# 5. Stopwords
meaningful_words = [w for w in words if not w in stop]
# 6. lemmitization
lemmitize_words = [lemmatizer.lemmatize(w) for w in meaningful_words]
# 7. space join words
return( ' '.join(lemmitize_words))
In [ ]:
X['review_clean'] = X['review'].apply(review_to_words)
In [ ]:
X.head()
Out[ ]:
| condition | review | review_clean | |
|---|---|---|---|
| 2 | Birth Control | I used to take another oral contraceptive, which had 21 pill cycle, and was very happy- very light periods, max 5 days, no other side effects. But it contained hormone gestodene, which is not available in US, so I switched to Lybrel, because the ingredients are similar. When my other pills ended, I started Lybrel immediately, on my first day of period, as the instructions said. And the period lasted for two weeks. When taking the second pack- same two weeks. And now, with third pack things got even worse- my third period lasted for two weeks and now it's the end of the third week- I still have daily brown discharge.\nThe positive side is that I didn't have any other side effects. The idea of being period free was so tempting... Alas. | used take another oral contraceptive pill cycle happy light period max day no side effect contained hormone gestodene not available u switched lybrel ingredient similar pill ended started lybrel immediately first day period instruction said period lasted two week taking second pack two week third pack thing got even worse third period lasted two week end third week still daily brown discharge positive side side effect idea period free tempting ala |
| 3 | Birth Control | This is my first time using any form of birth control. I'm glad I went with the patch, I have been on it for 8 months. At first It decreased my libido but that subsided. The only downside is that it made my periods longer (5-6 days to be exact) I used to only have periods for 3-4 days max also made my cramps intense for the first two days of my period, I never had cramps before using birth control. Other than that in happy with the patch | first time using form birth control glad went patch month first decreased libido subsided downside made period longer day exact used period day max also made cramp intense first two day period never cramp using birth control happy patch |
| 9 | Birth Control | I had been on the pill for many years. When my doctor changed my RX to chateal, it was as effective. It really did help me by completely clearing my acne, this takes about 6 months though. I did not gain extra weight, or develop any emotional health issues. I stopped taking it bc I started using a more natural method of birth control, but started to take it bc I hate that my acne came back at age 28. I really hope symptoms like depression, or weight gain do not begin to affect me as I am older now. I'm also naturally moody, so this may worsen things. I was in a negative mental rut today. Also I hope this doesn't push me over the edge, as I believe I am depressed. Hopefully it'll be just like when I was younger. | pill many year doctor changed rx chateal effective really help completely clearing acne take month though not gain extra weight develop emotional health issue stopped taking bc started using natural method birth control started take bc hate acne came back age really hope symptom like depression weight gain not begin affect older also naturally moody may worsen thing negative mental rut today also hope push edge believe depressed hopefully like younger |
| 11 | Depression | I have taken anti-depressants for years, with some improvement but mostly moderate to severe side affects, which makes me go off them.\n\nI only take Cymbalta now mostly for pain.\n\nWhen I began Deplin, I noticed a major improvement overnight. More energy, better disposition, and no sinking to the low lows of major depression. I have been taking it for about 3 months now and feel like a normal person for the first time ever. Best thing, no side effects. | taken anti depressant year improvement mostly moderate severe side affect make go take cymbalta mostly pain began deplin noticed major improvement overnight energy better disposition no sinking low low major depression taking month feel like normal person first time ever best thing no side effect |
| 13 | Cough | Have a little bit of a lingering cough from a cold. Not giving me much trouble except keeps me up at night. I heard this was good so I took so I could get some sleep. Helped tremendously with the cough but then I was having bad stomach cramps and diarrhea. I hadn't eaten anything that should have upset my stomach and it didn't really feel like a "bug" so I looked up side effects for Delsym. Now I wish I had done that first because I probably wouldn't have taken it. So, while it worked for my cough I still didn't get any sleep due to the stomach issues. | little bit lingering cough cold not giving much trouble except keep night heard good took could get sleep helped tremendously cough bad stomach cramp diarrhea eaten anything upset stomach really feel like bug looked side effect delsym wish done first probably taken worked cough still get sleep due stomach issue |
In [ ]:
X['review_clean'][11]
Out[ ]:
'taken anti depressant year improvement mostly moderate severe side affect make go take cymbalta mostly pain began deplin noticed major improvement overnight energy better disposition no sinking low low major depression taking month feel like normal person first time ever best thing no side effect'
Creating features and Target Variable¶
In [ ]:
X_feat = X['review_clean']
y = X['condition']
In [ ]:
X_train, X_test, y_train, y_test = train_test_split(X_feat, y,stratify=y,test_size=0.2, random_state=0)
Bag of Words¶
In [ ]:
count_vectorizer = CountVectorizer(stop_words='english')
count_train = count_vectorizer.fit_transform(X_train)
count_test = count_vectorizer.transform(X_test)
In [ ]:
from sklearn.feature_extraction.text import CountVectorizer
# Assuming X_train is your training data
# Initialize the CountVectorizer with English stop words
count_vectorizer = CountVectorizer(stop_words='english')
# Fit and transform the training data
count_train = count_vectorizer.fit_transform(X_train)
In [ ]:
# from sklearn.feature_extraction.text import CountVectorizer
# # Example sentence from the image provided by the user.
# sentence = df.head(10)['review'].tolist()
# # Initialize CountVectorizer: Since the image shows that stopwords are not removed we won't remove them here
# vectorizer = CountVectorizer()
# # Fit the vectorizer to the data
# bag_of_words = vectorizer.fit_transform(sentence)
# # Get the feature names to use as columns in a DataFrame
# feature_names = vectorizer.get_feature_names_out()
# # Create a DataFrame for the bag of words matrix
# df_bag_of_words = pd.DataFrame(bag_of_words.toarray(), columns=feature_names)
# # Display the DataFrame
# df_bag_of_words
Out[ ]:
| 039 | 15 | 21 | 230 | 26 | 28 | 2mg | 2nd | 2x | 3rd | ... | work | works | worse | worsen | worth | would | years | you | younger | zoloft | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 2 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 0 | 0 |
| 5 | 2 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | ... | 2 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 |
| 6 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 7 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
| 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 9 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 |
10 rows × 386 columns
In [ ]:
df.head(10)['review'].tolist()
Out[ ]:
['"It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil"', '"My son is halfway through his fourth week of Intuniv. We became concerned when he began this last week, when he started taking the highest dose he will be on. For two days, he could hardly get out of bed, was very cranky, and slept for nearly 8 hours on a drive home from school vacation (very unusual for him.) I called his doctor on Monday morning and she said to stick it out a few days. See how he did at school, and with getting up in the morning. The last two days have been problem free. He is MUCH more agreeable than ever. He is less emotional (a good thing), less cranky. He is remembering all the things he should. Overall his behavior is better. \nWe have tried many different medications and so far this is the most effective."', '"I used to take another oral contraceptive, which had 21 pill cycle, and was very happy- very light periods, max 5 days, no other side effects. But it contained hormone gestodene, which is not available in US, so I switched to Lybrel, because the ingredients are similar. When my other pills ended, I started Lybrel immediately, on my first day of period, as the instructions said. And the period lasted for two weeks. When taking the second pack- same two weeks. And now, with third pack things got even worse- my third period lasted for two weeks and now it's the end of the third week- I still have daily brown discharge.\nThe positive side is that I didn't have any other side effects. The idea of being period free was so tempting... Alas."', '"This is my first time using any form of birth control. I'm glad I went with the patch, I have been on it for 8 months. At first It decreased my libido but that subsided. The only downside is that it made my periods longer (5-6 days to be exact) I used to only have periods for 3-4 days max also made my cramps intense for the first two days of my period, I never had cramps before using birth control. Other than that in happy with the patch"', '"Suboxone has completely turned my life around. I feel healthier, I'm excelling at my job and I always have money in my pocket and my savings account. I had none of those before Suboxone and spent years abusing oxycontin. My paycheck was already spent by the time I got it and I started resorting to scheming and stealing to fund my addiction. All that is history. If you're ready to stop, there's a good chance that suboxone will put you on the path of great life again. I have found the side-effects to be minimal compared to oxycontin. I'm actually sleeping better. Slight constipation is about it for me. It truly is amazing. The cost pales in comparison to what I spent on oxycontin."', '"2nd day on 5mg started to work with rock hard erections however experianced headache, lower bowel preassure. 3rd day erections would wake me up & hurt! Leg/ankles aches severe lower bowel preassure like you need to go #2 but can't! Enjoyed the initial rockhard erections but not at these side effects or $230 for months supply! I'm 50 & work out 3Xs a week. Not worth side effects!"', '"He pulled out, but he cummed a bit in me. I took the Plan B 26 hours later, and took a pregnancy test two weeks later - - I'm pregnant."', '"Abilify changed my life. There is hope. I was on Zoloft and Clonidine when I first started Abilify at the age of 15.. Zoloft for depression and Clondine to manage my complete rage. My moods were out of control. I was depressed and hopeless one second and then mean, irrational, and full of rage the next. My Dr. prescribed me 2mg of Abilify and from that point on I feel like I have been cured though I know I'm not.. Bi-polar disorder is a constant battle. I know Abilify works for me because I have tried to get off it and lost complete control over my emotions. Went back on it and I was golden again. I am on 5mg 2x daily. I am now 21 and better than I have ever been in the past. Only side effect is I like to eat a lot."', '" I Ve had nothing but problems with the Keppera : constant shaking in my arms & legs & pins & needles feeling in my arms & legs severe light headedness no appetite & etc."', '"I had been on the pill for many years. When my doctor changed my RX to chateal, it was as effective. It really did help me by completely clearing my acne, this takes about 6 months though. I did not gain extra weight, or develop any emotional health issues. I stopped taking it bc I started using a more natural method of birth control, but started to take it bc I hate that my acne came back at age 28. I really hope symptoms like depression, or weight gain do not begin to affect me as I am older now. I'm also naturally moody, so this may worsen things. I was in a negative mental rut today. Also I hope this doesn't push me over the edge, as I believe I am depressed. Hopefully it'll be just like when I was younger."']
Directory Creation¶
In [ ]:
# Create the "models" folder if it doesn't exist
if not os.path.exists("models"):
os.makedirs("models")
if not os.path.exists("vectorizers"):
os.makedirs("vectorizers")
Sentiment Analysis¶
In [ ]:
from textblob import TextBlob
from tqdm import tqdm
reviews = df['review']
Predict_Sentiment = []
for review in tqdm(reviews):
blob = TextBlob(review)
Predict_Sentiment += [blob.sentiment.polarity]
df["Predict_Sentiment"] = Predict_Sentiment
df.head()
0%| | 71/213869 [00:00<05:03, 703.29it/s]100%|██████████| 213869/213869 [03:00<00:00, 1185.50it/s]
Out[ ]:
| uniqueID | drugName | condition | review | rating | date | usefulCount | Predict_Sentiment | |
|---|---|---|---|---|---|---|---|---|
| 0 | 206461 | Valsartan | Left Ventricular Dysfunction | "It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil" | 9 | May 20, 2012 | 27 | 0.000000 |
| 1 | 95260 | Guanfacine | ADHD | "My son is halfway through his fourth week of Intuniv. We became concerned when he began this last week, when he started taking the highest dose he will be on. For two days, he could hardly get out of bed, was very cranky, and slept for nearly 8 hours on a drive home from school vacation (very unusual for him.) I called his doctor on Monday morning and she said to stick it out a few days. See how he did at school, and with getting up in the morning. The last two days have been problem free. He is MUCH more agreeable than ever. He is less emotional (a good thing), less cranky. He is remembering all the things he should. Overall his behavior is better. \nWe have tried many different medications and so far this is the most effective." | 8 | April 27, 2010 | 192 | 0.168333 |
| 2 | 92703 | Lybrel | Birth Control | "I used to take another oral contraceptive, which had 21 pill cycle, and was very happy- very light periods, max 5 days, no other side effects. But it contained hormone gestodene, which is not available in US, so I switched to Lybrel, because the ingredients are similar. When my other pills ended, I started Lybrel immediately, on my first day of period, as the instructions said. And the period lasted for two weeks. When taking the second pack- same two weeks. And now, with third pack things got even worse- my third period lasted for two weeks and now it's the end of the third week- I still have daily brown discharge.\nThe positive side is that I didn't have any other side effects. The idea of being period free was so tempting... Alas." | 5 | December 14, 2009 | 17 | 0.067210 |
| 3 | 138000 | Ortho Evra | Birth Control | "This is my first time using any form of birth control. I'm glad I went with the patch, I have been on it for 8 months. At first It decreased my libido but that subsided. The only downside is that it made my periods longer (5-6 days to be exact) I used to only have periods for 3-4 days max also made my cramps intense for the first two days of my period, I never had cramps before using birth control. Other than that in happy with the patch" | 8 | November 3, 2015 | 10 | 0.179545 |
| 4 | 35696 | Buprenorphine / naloxone | Opiate Dependence | "Suboxone has completely turned my life around. I feel healthier, I'm excelling at my job and I always have money in my pocket and my savings account. I had none of those before Suboxone and spent years abusing oxycontin. My paycheck was already spent by the time I got it and I started resorting to scheming and stealing to fund my addiction. All that is history. If you're ready to stop, there's a good chance that suboxone will put you on the path of great life again. I have found the side-effects to be minimal compared to oxycontin. I'm actually sleeping better. Slight constipation is about it for me. It truly is amazing. The cost pales in comparison to what I spent on oxycontin." | 9 | November 27, 2016 | 37 | 0.194444 |
In [ ]:
df['sentiment'] = df["rating"].apply(lambda x: 1 if x > 5 else 0)
df.head()
Out[ ]:
| uniqueID | drugName | condition | review | rating | date | usefulCount | Predict_Sentiment | sentiment | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 206461 | Valsartan | Left Ventricular Dysfunction | "It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil" | 9 | May 20, 2012 | 27 | 0.000000 | 1 |
| 1 | 95260 | Guanfacine | ADHD | "My son is halfway through his fourth week of Intuniv. We became concerned when he began this last week, when he started taking the highest dose he will be on. For two days, he could hardly get out of bed, was very cranky, and slept for nearly 8 hours on a drive home from school vacation (very unusual for him.) I called his doctor on Monday morning and she said to stick it out a few days. See how he did at school, and with getting up in the morning. The last two days have been problem free. He is MUCH more agreeable than ever. He is less emotional (a good thing), less cranky. He is remembering all the things he should. Overall his behavior is better. \nWe have tried many different medications and so far this is the most effective." | 8 | April 27, 2010 | 192 | 0.168333 | 1 |
| 2 | 92703 | Lybrel | Birth Control | "I used to take another oral contraceptive, which had 21 pill cycle, and was very happy- very light periods, max 5 days, no other side effects. But it contained hormone gestodene, which is not available in US, so I switched to Lybrel, because the ingredients are similar. When my other pills ended, I started Lybrel immediately, on my first day of period, as the instructions said. And the period lasted for two weeks. When taking the second pack- same two weeks. And now, with third pack things got even worse- my third period lasted for two weeks and now it's the end of the third week- I still have daily brown discharge.\nThe positive side is that I didn't have any other side effects. The idea of being period free was so tempting... Alas." | 5 | December 14, 2009 | 17 | 0.067210 | 0 |
| 3 | 138000 | Ortho Evra | Birth Control | "This is my first time using any form of birth control. I'm glad I went with the patch, I have been on it for 8 months. At first It decreased my libido but that subsided. The only downside is that it made my periods longer (5-6 days to be exact) I used to only have periods for 3-4 days max also made my cramps intense for the first two days of my period, I never had cramps before using birth control. Other than that in happy with the patch" | 8 | November 3, 2015 | 10 | 0.179545 | 1 |
| 4 | 35696 | Buprenorphine / naloxone | Opiate Dependence | "Suboxone has completely turned my life around. I feel healthier, I'm excelling at my job and I always have money in my pocket and my savings account. I had none of those before Suboxone and spent years abusing oxycontin. My paycheck was already spent by the time I got it and I started resorting to scheming and stealing to fund my addiction. All that is history. If you're ready to stop, there's a good chance that suboxone will put you on the path of great life again. I have found the side-effects to be minimal compared to oxycontin. I'm actually sleeping better. Slight constipation is about it for me. It truly is amazing. The cost pales in comparison to what I spent on oxycontin." | 9 | November 27, 2016 | 37 | 0.194444 | 1 |
In [ ]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB()),
])
Machine Learning Model : Naive Bayes¶
In [ ]:
mnb = MultinomialNB()
mnb.fit(count_train, y_train)
with open("models/multinomial_nb_model.pkl", "wb") as file:
pickle.dump(mnb, file)
pred = mnb.predict(count_test)
naive_bayes_score = metrics.accuracy_score(y_test, pred)
print("Accuracy : %0.3f" % naive_bayes_score)
Accuracy : 0.903
Machine Learning Model : Passive Aggressive Classifier¶
In [ ]:
passive = PassiveAggressiveClassifier()
passive.fit(count_train, y_train)
with open("models/passive_aggressive_model.pkl", "wb") as file:
pickle.dump(passive, file)
pred = passive.predict(count_test)
pass_aggr_score = metrics.accuracy_score(y_test, pred)
print("Accuracy: %0.3f" % pass_aggr_score)
Accuracy: 0.929
Machine Learning Model : TFIDF¶
In [ ]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.8)
tfidf_train = tfidf_vectorizer.fit_transform(X_train)
tfidf_test = tfidf_vectorizer.transform(X_test)
pass_tf = PassiveAggressiveClassifier()
pass_tf.fit(tfidf_train, y_train)
with open("models/tfidf_model.pkl", "wb") as file:
pickle.dump(pass_tf, file)
pred = pass_tf.predict(tfidf_test)
ml_tfidf_score = metrics.accuracy_score(y_test, pred)
print("Accuracy: %0.3f" % ml_tfidf_score)
Accuracy: 0.940
TFIDF: Bigrams¶
In [ ]:
tfidf_vectorizer2 = TfidfVectorizer(stop_words='english', max_df=0.8, ngram_range=(1,2))
tfidf_train_2 = tfidf_vectorizer2.fit_transform(X_train)
tfidf_test_2 = tfidf_vectorizer2.transform(X_test)
pass_tf = PassiveAggressiveClassifier()
pass_tf.fit(tfidf_train_2, y_train)
with open("models/tfidf_bigrams_model.pkl", "wb") as file:
pickle.dump(pass_tf, file)
with open("vectorizers/tfidf_vectorizer2.pkl", "wb") as f:
pickle.dump(tfidf_vectorizer2, f)
pred = pass_tf.predict(tfidf_test_2)
bi_tfidf_score = metrics.accuracy_score(y_test, pred)
print("Accuracy: %0.3f" % bi_tfidf_score)
Accuracy: 0.961
TFIDF : Trigrams¶
In [ ]:
tfidf_vectorizer3 = TfidfVectorizer(stop_words='english', max_df=0.8, ngram_range=(1,3)) #n-grams: string of elements like text
tfidf_train_3 = tfidf_vectorizer3.fit_transform(X_train)
tfidf_test_3 = tfidf_vectorizer3.transform(X_test)
pass_tf = PassiveAggressiveClassifier()
pass_tf.fit(tfidf_train_3, y_train)
with open("models/tfidf_trigrams_model.pkl", "wb") as file:
pickle.dump(pass_tf, file)
with open("vectorizers/tfidf_vectorizer3.pkl", "wb") as f:
pickle.dump(tfidf_vectorizer3, f)
pred = pass_tf.predict(tfidf_test_3)
tri_tfidf_score = metrics.accuracy_score(y_test, pred)
print("Accuracy: %0.3f" % tri_tfidf_score)
Accuracy: 0.962
In [ ]:
tfidf_vectorizer3 = TfidfVectorizer(stop_words='english', max_df=0.8, ngram_range=(1,3)) #n-grams: string of elements like text
tfidf_train_3 = tfidf_vectorizer3.fit_transform(X_train)
tfidf_test_3 = tfidf_vectorizer3.transform(X_test)
from sklearn.svm import LinearSVC
pass_tf = LinearSVC()
pass_tf.fit(tfidf_train_3, y_train)
pred = pass_tf.predict(tfidf_test_3)
tri_tfidf_score = metrics.accuracy_score(y_test, pred)
print("Accuracy: %0.3f" % tri_tfidf_score)
Accuracy: 0.962
In [ ]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, pred)
sns.heatmap(confusion_matrix(y_test, pred), annot=True, fmt='d', cmap='Blues', cbar=False)
Out[ ]:
<Axes: >
In [ ]:
import matplotlib.pyplot as plt
# Create a bar graph to compare the accuracy of each model
models = ['multinomial_nb_model', 'passive_aggressive_model', 'ml_tfidf_model', 'bi_tfidf_model', 'tri_tfidf_model']
accuracy = [naive_bayes_score, pass_aggr_score, ml_tfidf_score, bi_tfidf_score, tri_tfidf_score]
plt.figure(figsize=(8, 5))
plt.bar(models, accuracy, color='skyblue')
plt.xlabel('Models')
plt.ylabel('Accuracy')
plt.title('Accuracy of Different Models')
plt.ylim(0, 1) # Set the y-axis limits
plt.xticks(rotation=45) # Rotate x-axis labels for better readability
plt.grid(axis='y', linestyle='--', alpha=0.7) # Add horizontal grid lines
for i, v in enumerate(accuracy):
plt.text(i, v + 0.01, str(round(v, 2)), ha='center', va='bottom', fontsize=9) # Add text labels on bars
plt.tight_layout() # Adjust layout to prevent clipping of labels
plt.show()
In [ ]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_validate
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import numpy as np
# Define pipelines
pipeline_mnb = Pipeline([
('tfidf', TfidfVectorizer(stop_words='english')),
('clf', MultinomialNB())
])
pipeline_passive = Pipeline([
('tfidf', TfidfVectorizer(stop_words='english')),
('clf', PassiveAggressiveClassifier())
])
pipeline_ml_tfidf = Pipeline([
('tfidf', TfidfVectorizer(stop_words='english')),
('clf', PassiveAggressiveClassifier())
])
pipeline_bi_tfidf = Pipeline([
('tfidf', TfidfVectorizer(stop_words='english', ngram_range=(1, 2))),
('clf', PassiveAggressiveClassifier())
])
pipeline_tri_tfidf = Pipeline([
('tfidf', TfidfVectorizer(stop_words='english', ngram_range=(1, 3))),
('clf', PassiveAggressiveClassifier())
])
# Define the metrics to be used for evaluation
scoring = {'accuracy': 'accuracy',
'precision_macro': 'precision_macro',
'recall_macro': 'recall_macro',
'f1_macro': 'f1_macro'}
# Perform cross-validation for each model
cv_results_mnb = cross_validate(pipeline_mnb, X_train, y_train, cv=5, scoring=scoring)
cv_results_passive = cross_validate(pipeline_passive, X_train, y_train, cv=5, scoring=scoring)
cv_results_ml_tfidf = cross_validate(pipeline_ml_tfidf, X_train, y_train, cv=5, scoring=scoring)
cv_results_bi_tfidf = cross_validate(pipeline_bi_tfidf, X_train, y_train, cv=5, scoring=scoring)
cv_results_tri_tfidf = cross_validate(pipeline_tri_tfidf, X_train, y_train, cv=5, scoring=scoring)
# Calculate mean scores for each metric
mean_scores_mnb = {metric: np.mean(cv_results_mnb[f'test_{metric}']) for metric in scoring}
mean_scores_passive = {metric: np.mean(cv_results_passive[f'test_{metric}']) for metric in scoring}
mean_scores_ml_tfidf = {metric: np.mean(cv_results_ml_tfidf[f'test_{metric}']) for metric in scoring}
mean_scores_bi_tfidf = {metric: np.mean(cv_results_bi_tfidf[f'test_{metric}']) for metric in scoring}
mean_scores_tri_tfidf = {metric: np.mean(cv_results_tri_tfidf[f'test_{metric}']) for metric in scoring}
# Print mean scores for each model
print("Mean scores for Multinomial Naive Bayes:")
print(mean_scores_mnb)
print()
print("Mean scores for Passive Aggressive:")
print(mean_scores_passive)
print()
print("Mean scores for Machine Learning TF-IDF:")
print(mean_scores_ml_tfidf)
print()
print("Mean scores for Bigram TF-IDF:")
print(mean_scores_bi_tfidf)
print()
print("Mean scores for Trigram TF-IDF:")
print(mean_scores_tri_tfidf)
Mean scores for Multinomial Naive Bayes:
{'accuracy': 0.7754161259553705, 'precision_macro': 0.9163488566352612, 'recall_macro': 0.49411498534902076, 'f1_macro': 0.5839305004348863}
Mean scores for Passive Aggressive:
{'accuracy': 0.9276892448006727, 'precision_macro': 0.907120967091099, 'recall_macro': 0.8957312216113833, 'f1_macro': 0.9011581307014828}
Mean scores for Machine Learning TF-IDF:
{'accuracy': 0.9272268710094081, 'precision_macro': 0.908351895424288, 'recall_macro': 0.8934150716746707, 'f1_macro': 0.900516054300755}
Mean scores for Bigram TF-IDF:
{'accuracy': 0.9476761547074766, 'precision_macro': 0.9353088626752282, 'recall_macro': 0.920869051753413, 'f1_macro': 0.9277062838550609}
Mean scores for Trigram TF-IDF:
{'accuracy': 0.9487948408444369, 'precision_macro': 0.9341409909874958, 'recall_macro': 0.922054979651481, 'f1_macro': 0.9276978831715997}
In [ ]:
# Calculate loss for each model
loss_mnb = 1 - mean_scores_mnb['accuracy']
loss_passive = 1 - mean_scores_passive['accuracy']
loss_ml_tfidf = 1 - mean_scores_ml_tfidf['accuracy']
loss_bi_tfidf = 1 - mean_scores_bi_tfidf['accuracy']
loss_tri_tfidf = 1 - mean_scores_tri_tfidf['accuracy']
# Print loss for each model
print("Loss for Multinomial Naive Bayes:", loss_mnb)
print("Loss for Passive Aggressive:", loss_passive)
print("Loss for Machine Learning TF-IDF:", loss_ml_tfidf)
print("Loss for Bigram TF-IDF:", loss_bi_tfidf)
print("Loss for Trigram TF-IDF:", loss_tri_tfidf)
Loss for Multinomial Naive Bayes: 0.22458387404462954 Loss for Passive Aggressive: 0.07231075519932728 Loss for Machine Learning TF-IDF: 0.07277312899059185 Loss for Bigram TF-IDF: 0.05232384529252343 Loss for Trigram TF-IDF: 0.05120515915556312
In [ ]:
'''
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
# Load dataset
data = fetch_20newsgroups(subset='all', shuffle=True, random_state=42)
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)
# Define models
models = {
"LogisticRegression": LogisticRegression(),
"RidgeClassifier": RidgeClassifier(),
"MultinomialNB": MultinomialNB(),
"LinearSVC": LinearSVC()
}
# Vectorize data
vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)
# Results list to store model metrics
results = []
# Iterate over models
for model_name, model in models.items():
# Train model
model.fit(X_train_vec, y_train)
# Test model
y_pred = model.predict(X_test_vec)
# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')
# Append results
results.append([model_name, precision, recall, f1, accuracy])
# Convert results to DataFrame
results_df = pd.DataFrame(results, columns=["Model", "Precision", "Recall", "F1", "Accuracy"])
# Print table
print("TABLE BAG-OF-WORDS")
print(results_df)
'''
Out[ ]:
'\nimport pandas as pd\nfrom sklearn.datasets import fetch_20newsgroups\nfrom sklearn.feature_extraction.text import CountVectorizer\nfrom sklearn.linear_model import LogisticRegression, RidgeClassifier\nfrom sklearn.naive_bayes import MultinomialNB\nfrom sklearn.svm import LinearSVC\nfrom sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score\nfrom sklearn.model_selection import train_test_split\n\n# Load dataset\ndata = fetch_20newsgroups(subset=\'all\', shuffle=True, random_state=42)\n\n# Split data into train and test sets\nX_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=42)\n\n# Define models\nmodels = {\n "LogisticRegression": LogisticRegression(),\n "RidgeClassifier": RidgeClassifier(),\n "MultinomialNB": MultinomialNB(),\n "LinearSVC": LinearSVC()\n}\n\n# Vectorize data\nvectorizer = CountVectorizer()\nX_train_vec = vectorizer.fit_transform(X_train)\nX_test_vec = vectorizer.transform(X_test)\n\n# Results list to store model metrics\nresults = []\n\n# Iterate over models\nfor model_name, model in models.items():\n # Train model\n model.fit(X_train_vec, y_train)\n\n # Test model\n y_pred = model.predict(X_test_vec)\n\n # Calculate metrics\n accuracy = accuracy_score(y_test, y_pred)\n precision = precision_score(y_test, y_pred, average=\'weighted\')\n recall = recall_score(y_test, y_pred, average=\'weighted\')\n f1 = f1_score(y_test, y_pred, average=\'weighted\')\n\n # Append results\n results.append([model_name, precision, recall, f1, accuracy])\n\n# Convert results to DataFrame\nresults_df = pd.DataFrame(results, columns=["Model", "Precision", "Recall", "F1", "Accuracy"])\n \n# Print table\nprint("TABLE BAG-OF-WORDS")\nprint(results_df)\n'
In [ ]:
# mnb: Multinomial Naive Bayes model
# passive: Passive Aggressive Classifier model
# Train the models
mnb.fit(count_train, y_train)
passive.fit(count_train, y_train)
# Get predictions for both training and test sets
train_pred_mnb = mnb.predict(count_train)
test_pred_mnb = mnb.predict(count_test)
train_pred_passive = passive.predict(count_train)
test_pred_passive = passive.predict(count_test)
# Calculate the evaluation metrics for both training and test sets
train_accuracy_mnb = metrics.accuracy_score(y_train, train_pred_mnb)
test_accuracy_mnb = metrics.accuracy_score(y_test, test_pred_mnb)
train_accuracy_passive = metrics.accuracy_score(y_train, train_pred_passive)
test_accuracy_passive = metrics.accuracy_score(y_test, test_pred_passive)
# Plot the results
plt.figure(figsize=(10, 5))
# Multinomial Naive Bayes
plt.subplot(1, 2, 1)
plt.bar(['Train Accuracy', 'Test Accuracy'], [train_accuracy_mnb, test_accuracy_mnb], color=['blue', 'orange'])
plt.ylim(0, 1)
plt.title('Multinomial Naive Bayes')
# Passive Aggressive Classifier
plt.subplot(1, 2, 2)
plt.bar(['Train Accuracy', 'Test Accuracy'], [train_accuracy_passive, test_accuracy_passive], color=['blue', 'orange'])
plt.ylim(0, 1)
plt.title('Passive Aggressive Classifier')
plt.suptitle('Training vs Test Accuracy')
plt.show()
Drug Recommendation¶
In [ ]:
df_drug = df[(df['rating']>=9)&(df['usefulCount']>=100)].sort_values(by = ['rating', 'usefulCount'], ascending = [False, False])
In [ ]:
def recommend_drug(disease):
recommended_drug_list = df_drug[df_drug['condition']==disease]['drugName'].head(3).tolist()
return recommended_drug_list
In [ ]:
recommend_drug("GERD")
Out[ ]:
['Zantac 150', 'Ranitidine', 'Zantac']
Predictions¶
In [ ]:
# df_test = df.groupby(['condition','drugName']).agg({'total_pred' : ['mean']})
# df_test
In [ ]:
# tfidf_trigrams_model has highest accuracy
with open('models/tfidf_trigrams_model.pkl', 'rb') as f:
tfidf_trigrams_model = pickle.load(f)
with open('vectorizers/tfidf_vectorizer3.pkl', 'rb') as f:
tfidf_vectorizer = pickle.load(f)
In [ ]:
text = ["Increased thirst. Frequent urination. Increased hunger. Unintended weight loss. Fatigue. Blurred vision. Slow-healing sores. Frequent infections."]
text_transformed = tfidf_vectorizer.transform(text)
prediction = tfidf_trigrams_model.predict(text_transformed)[0]
print("Predicted disease :", prediction)
recommend_drug(prediction)
Predicted disease : Diabetes, Type 2
Out[ ]:
['Victoza', 'Liraglutide', 'Canagliflozin']
In [ ]:
text = ["Difficulty falling asleep at night. Waking up during the night. Waking up too early. Not feeling well-rested after a night's sleep. Daytime tiredness or sleepiness. Irritability, depression or anxiety. Difficulty paying attention, focusing on tasks or remembering. Increased errors or accidents."]
text_transformed = tfidf_vectorizer.transform(text)
prediction = tfidf_trigrams_model.predict(text_transformed)[0]
print("Predicted disease recommend_drug(prediction):", prediction)
recommend_drug(prediction)
Predicted disease recommend_drug(prediction): Insomnia
Out[ ]:
['Trazodone', 'Clonazepam', 'Remeron']
In [ ]:
text = ["Crusting of skin bumps. Cysts. Papules (small red bumps) Pustules (small red bumps containing white or yellow pus) Redness around the skin eruptions. Scarring of the skin. Whiteheads. Blackheads."]
text_transformed = tfidf_vectorizer.transform(text)
prediction = tfidf_trigrams_model.predict(text_transformed)[0]
print("Predicted disease :", prediction)
recommend_drug(prediction)
Predicted disease : Acne
Out[ ]:
['Tretinoin', 'Spironolactone', 'Differin']
In [ ]:
text = ["Spotting between periods. Breakthrough bleeding, or spotting, refers to when vaginal bleeding occurs between menstrual cycles. Nausea. Breast tenderness. Headaches and migraine. Weight gain. Mood changes. Missed periods. Decreased libido."]
text_transformed = tfidf_vectorizer.transform(text)
prediction = tfidf_trigrams_model.predict(text_transformed)[0]
print("Predicted disease :", prediction)
recommend_drug(prediction)
Predicted disease : Birth Control
Out[ ]:
['Mirena', 'Levonorgestrel', 'Implanon']
In [ ]:
text = ["A burning sensation in your chest (heartburn), usually after eating, which might be worse at night or while lying down. Backwash (regurgitation) of food or sour liquid. Upper abdominal or chest pain. Trouble swallowing (dysphagia) Sensation of a lump in your throat."]
text_transformed = tfidf_vectorizer.transform(text)
prediction = tfidf_trigrams_model.predict(text_transformed)[0]
print("Predicted disease :", prediction)
recommend_drug(prediction)
Predicted disease : GERD
Out[ ]:
['Zantac 150', 'Ranitidine', 'Zantac']
In [ ]: